Cargar datos “data/bank_2.csv”
library(tidyverse)
library(funModeling)
library(corrr)
data_bank=read_delim("../data/bank_2.csv", delim = ";")
1 - Encontrar las correlaciones de las var numéricas entre si.
cor_bank=data_bank %>% select_if(is.numeric) %>% correlate()
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
cor_bank
## # A tibble: 7 x 8
## rowname age balance day duration campaign pdays previous
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age NA 0.00297 -0.00865 0.0168 -0.00206 -0.00440 -0.00230
## 2 balance 0.00297 NA 0.0105 0.0224 -0.0139 0.0174 0.0308
## 3 day -0.00865 0.0105 NA -0.0185 0.137 -0.0772 -0.0590
## 4 duration 0.0168 0.0224 -0.0185 NA -0.0416 -0.0274 -0.0267
## 5 campaign -0.00206 -0.0139 0.137 -0.0416 NA -0.103 -0.0497
## 6 pdays -0.00440 0.0174 -0.0772 -0.0274 -0.103 NA 0.507
## 7 previous -0.00230 0.0308 -0.0590 -0.0267 -0.0497 0.507 NA
2 - Encontrar todas las correlaciones lineales entre las variables de entrada y la salida
cor_bank_2=data_bank %>% select_if(is.numeric) %>% correlate() %>% stretch()
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
cor_bank_2
## # A tibble: 49 x 3
## x y r
## <chr> <chr> <dbl>
## 1 age age NA
## 2 age balance 0.00297
## 3 age day -0.00865
## 4 age duration 0.0168
## 5 age campaign -0.00206
## 6 age pdays -0.00440
## 7 age previous -0.00230
## 8 balance age 0.00297
## 9 balance balance NA
## 10 balance day 0.0105
## # … with 39 more rows
3 - Encontrar las variables mas importantes según el gain ratio (puede tardar)
# truco para que ande mas rapido: tomar una muestra
data_bank_sample=sample_n(data_bank,1000)
res_rank_info_bank=var_rank_info(data = data_bank_sample, target = "deposit")
## Warning in KL.plugin(freqs2d, freqs.null, unit = unit): Vanishing value(s)
## in argument freqs2!
## Warning in KL.plugin(freqs2d, freqs.null, unit = unit): Vanishing value(s)
## in argument freqs2!
res_rank_info_bank
## var en mi ig gr
## 1 pdays 3.411 0.195 0.19463516730 0.0747986747
## 2 contact 2.015 0.057 0.05688788373 0.0531448615
## 3 poutcome 2.159 0.051 0.05095001807 0.0421652953
## 4 housing 1.969 0.031 0.03093520064 0.0309441298
## 5 previous 2.513 0.043 0.04261050342 0.0274274906
## 6 month 4.027 0.079 0.07884737855 0.0253719285
## 7 age 6.295 0.082 0.08184351447 0.0152201445
## 8 loan 1.508 0.007 0.00719833180 0.0139823484
## 9 campaign 3.266 0.026 0.02597614846 0.0113154541
## 10 job 4.118 0.021 0.02130729548 0.0067876409
## 11 day 5.809 0.032 0.03158712250 0.0065254371
## 12 education 2.582 0.009 0.00919664433 0.0057739176
## 13 marital 2.343 0.007 0.00727194433 0.0053913876
## 14 default 1.087 0.000 0.00007547927 0.0008640825
## 15 balance 9.363 0.797 NA NA
## 16 duration 9.296 0.698 NA NA
Analizar la correlación entre las variables de entrada y la salida ‘deposit’. Si no le pasan ‘input’, entonces corre para todas las variables ;)
4 - Usar función cross_plot
cross_plot(data = data_bank, target = 'deposit')
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Plotting transformed variable 'age' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Plotting transformed variable 'balance' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Plotting transformed variable 'day' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Plotting transformed variable 'duration' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Plotting transformed variable 'campaign' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Plotting transformed variable 'pdays' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
## Plotting transformed variable 'previous' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
5 - Usar función plotar (boxplot e histdens)
plotar(data = data_bank, target = 'deposit', plot_type = 'boxplot')
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
plotar(data = data_bank, target = 'deposit', plot_type = 'histdens')
## Warning in remove_na_target(data, target = target): There were removed 5
## rows with NA values in target variable 'deposit'.
6 - Usar función rplot() del paquete corrr (googlear) en el set de datos mtcars mtcars ya está cargado en el entorno de R como iris
mtcars %>% select_if(is.numeric) %>% correlate() %>% rplot()
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'